POS error detection in automatically annotated corpora
نویسنده
چکیده
Recent work on error detection has shown that the quality of manually annotated corpora can be substantially improved by applying consistency checks to the data and automatically identifying incorrectly labelled instances. These methods, however, can not be used for automatically annotated corpora where errors are systematic and cannot easily be identified by looking at the variance in the data. This paper targets the detection of POS errors in automatically annotated corpora, so-called silver standards, showing that by combining different measures sensitive to annotation quality we can identify a large part of the errors and obtain a substantial increase in accuracy.
منابع مشابه
Automatic Error Detection in Annotated Corpora
Annotated corpus is a linguistic resource which explicitly encodes the information at syntactic and semantic levels for each sentence. Annotated corpora play a crucial role in many applications of natural language processing (NLP). Error free and consistent annotated corpora is vital for these applications. Creating annotated corpora is an expensive and time consuming process. Errors or anomali...
متن کاملFrom Detecting Errors to Automatically Correcting Them
Faced with the problem of annotation errors in part-of-speech (POS) annotated corpora, we develop a method for automatically correcting such errors. Building on top of a successful error detection method, we first try correcting a corpus using two off-the-shelf POS taggers, based on the idea that they enforce consistency; with this, we find some improvement. After some discussion of the tagging...
متن کاملFeature-Rich Part-Of-Speech Tagging Using Deep Syntactic and Semantic Analysis
This paper describes the implementation, improvement and evaluation of the machine translation (MT) system proposed by Jackov (2014) when used as a feature-rich part-ofspeech (POS) tagger for Bulgarian. The system does not rely on POS tagging for morphological disambiguation. Instead, all ambiguities are considered in parsing hypotheses that are scored and the best one is used for tagging. The ...
متن کاملAn Annotated Corpus Management Tool: ChaKi
Large scale annotated corpora are very important not only in linguistic research but also in practical natural language processing tasks since a number of practical tools such as Part-of-speech (POS) taggers and syntactic parsers are now corpus-based or machine learningbased systems which require some amount of accurately annotated corpora. This article presents an annotated corpus management t...
متن کاملDetecting Errors in Corpora Using Support Vector Machines
While the corpus-based research relies on human annotated corpora, it is often said that a non-negligible amount of errors remain even in frequently used corpora such as Penn Treebank. Detection of errors in annotated corpora is important for corpus-based natural language processing. In this paper, we propose a method to detect errors in corpora using support vector machines (SVMs). This method...
متن کامل